Effects of combining feedback and hypothesis-testing on the quality of simulated child sexual abuse interviews with avatars among Chinese university students

Previous research has shown that simulation training using avatars with repeated feedback improves child sexual abuse interview quality. The present study added a hypothesis-testing intervention and examined if the combination of two interventions, feedback and hypothesis-testing, would improve interview quality compared to no intervention and to either intervention alone. Eighty-one Chinese university students were randomly assigned to a control, feedback, hypothesis-testing, or the combination of feedback and hypothesis-testing group and conducted five simulated child sexual abuse interviews online. Depending on the assigned group, feedback on the outcome of the cases and question types used in the interview were provided after each interview, and/or the participants built hypotheses based on preliminary case information before each interview. The combined interventions group and feedback group showed a higher proportion of recommended questions and correct details from the 3rd interview onward compared to the hypothesis-building and control groups. The difference between the number of correct conclusions was not significant. hypothesis-testing alone exacerbated the use of non-recommended questions over time. The results show that hypothesis-testing may impact question types used negatively but not when combined with feedback. The potential reasons for hypothesis-testing alone not being effective and the differences between the present and previous studies were discussed.


Introduction
In 2020, China's Procuratorate prosecuted 15,365 persons for rape of minors, 5,880 for molesting children, and 1,461 persons for forcibly molesting and insulting minors, with increases of 19%, 15%, and 12%, respectively, compared with 2019 [1]. The seriousness and the increasing numbers of child sexual abuse (CSA) cases underlines the urgency of establishing evidencebased training for conducting such investigations. Improving the quality of the investigative interviews is of central importance as in many CSA cases, the child's statement often serves as the most crucial piece of evidence, since there are rarely other pieces of evidence available in these cases [2][3][4]. Human memory is a reconstructive process subject to situational influences [5], making the quality of statements provided by both adults [6][7][8] and children [9,10] not merely a function of interviewee's memories but also of the interview process. Therefore, in investigative settings where the accuracy of memory reports is of great importance, interviewers should avoid exerting social influences that could contaminate the memories of the interviewees. This is especially true for child interviewing since children are, in general, more susceptible to suggestion than adults [11][12][13][14]. Therefore, adopting open-ended questions and avoiding closed-ended questions when interviewing children is an important recommendation [15][16][17]. In reality, however, the quality of investigative interviews with children remains a concern in various countries because of the continuing heavy use of closed questions, which may result in distortions of testimony, including contradictions and inaccurate details [18][19][20][21]. Given that the problem with the quality of investigative interviews is an issue worldwide and that the development of evidence-based training programs in China is still in its infancy, it is logical to expect that China also faces similar problems in CSA investigations.
In order to increase and maintain the use of open-ended questions in investigative interviews, different training programs have been developed and implemented. Nevertheless, there is a gap between knowing what is the best practice and the actual adoption of the appropriate question types during an interview. Programs that provide merely theoretical knowledge are not efficient in improving the interview quality [22][23][24]. Lamb et al. [25] showed that by giving a structured interview protocol to the interviewer, including the types of questions to use at different stages, combined with detailed feedback after interviews, their quality could be improved. However, this solution entails significant costs, and the training effects tend to disappear soon after feedback is not provided [25]. Besides, when real cases are used for training, the only type of feedback that can be provided to the interviewer concerns the questions used. No feedback can usually be given regarding whether the interviewer actually found out what had happened or not.
As a solution, a serious gaming training protocol that uses simulated CSA interviews with computer-generated avatars (hereafter, Avatar Training) has been proposed [26]. Avatar training, designed to promote the quality of investigative CSA interviews, creates a realistic interviewing setting for the interviewer to question a child avatar concerning suspected sexual abuse and gather information from the avatar as if they were real children. In this program, not only can interviewers receive feedback on the question types they used after each simulated interview, but also feedback on the outcome (i.e., whether the interviewer arrived at the correct conclusion). A series of experiments [27][28][29][30] has shown that simulated interviews combined with feedback on the outcomes and the question types resulted in more open-ended questions and less closed questions being used compared to the use of simulated interviews alone. Importantly, this training effect transferred into interviews with real children about a mock event sharing features with sexual abuse [31] as well as to actual investigative interviews in criminal cases [32]. Regarding the applicability of Avatar Training, previous studies conducted in both Western contexts [26,29,30,31] as well as in Japan [27,28,33], have supported its effectiveness.
The adoption of a less suggestive questioning style, however, does not adequately address another important issue in investigative interviews, which is, confirmation bias. Confirmation bias refers to a universal human tendency to look for and interpret evidence that is consistent with one's prior beliefs [34,35]. Literature on information acquisition has clearly shown that people have a tendency to seek evidence confirming their initial belief [36][37][38][39], and to ignore information that goes against it [40,41], with potentially negative effects on professionals investigating CSA allegations [12,[42][43][44]. Hence, pre-interview beliefs held by interviewers are central and may lead to confirmation bias, resulting in use of closed-ended questions [45,46] that may endanger the reliability of responses elicited from children and in extreme cases can make the interviewer arrive at an incorrect conclusion [47][48][49]. Besides, previous experiments using the child avatars have also found that a preliminary assumption of abuse resulted in more frequent use of non-recommended question types [35]. However, even if suggestive questions are absent, open questions may still focus on a particular assumption about what has happened. Focusing solely on question types may, therefore, be inadequate to address the issue of confirmation bias in CSA investigative interviews.
To address the problem with confirmation bias, we decided to investigate the impact of adding a hypothesis-building intervention to the Avatar Training. Formulating alternative hypotheses prior to interviewing a child has been suggested multiple times as a potential technique to counteract confirmation bias and to reduce false positive conclusions [42,[50][51][52]. Hypothesis-building here refers to the practice where, after a careful assessment of background information but before the interview, interviewers formulate a series of alternative hypotheses about the case, with particular attention paid to how the allegation came about, risk factors for abuse, potential risks for pre-interview suggestive influences or misunderstandings [50]. Using a hypothesis-testing approach is considered central in avoiding attempts at only trying to confirm a single preliminary assumption in a CSA investigation as it allows the formation and evaluation of alternative hypotheses that should cover all plausible scenarios leading to the abuse allegation [53,54]. It has been argued that adopting a hypothesis-testing approach should especially decrease the risk of false-positive errors [55] given that in a CSA investigation a hypothesis of abuse is a necessary premise and may remain the only hypothesis if conscious efforts to add other hypotheses are not undertaken. Thus, in the current study, we integrated a hypothesis-testing procedure into the Avatar Training.
However, there is some tension between hypothesis-testing and the use of open-ended questions, the capstone of a good quality interview. Hypothesis-testing can be construed as a confirmatory task in which questions are formed to confirm or rule out hypotheses [38,56]. Effective hypothesis-testing, when not considering children's cognitive limitations [57,58], including the tendency of children to give affirmative responses even if they do not know the answer to the question [12,59,60], typically includes the use of close-ended questions [38,61]. These may be closed directive, option-posing questions, whilst open-ended questions are not often used since this type of question may not seem an ideal way of testing alternative hypotheses. Thus, in terms of question types, the information-acquiring process in hypothesis-testing may, to some extent, be indistinguishable from low-quality interviewing. Therefore, the challenge lies in achieving good question quality while being able to test hypotheses. An emphasis on hypothesis-testing perhaps always drives question quality down. This could especially be the case if both the hypothesis-testing and best practice in terms of question types are learned simultaneously which might cause cognitive overload for the learner [62][63][64]. However, it is also possible that hypothesis-testing counteracts a sole focus on a CSA hypothesis and that this will result in more open information seeking and, consequently, more use of open questions. It may also be that a combination of a hypothesis-testing approach with feedback on question types would not result in any negative effect on interview quality when measured in question types. That is, interviewers may be able to execute the hypothesis-testing process to a satisfactory level primarily through the use of open questions in this situation.
The present study was the first to bring the two approaches together. We examined not only the effect of teaching hypothesis-testing alone on the quality of CSA investigative interviewing but also the effects of a combination of hypothesis-testing and feedback on question types. The validity of a Chinese version of the Avatar Training approach was assessed as well. The present study tested the following hypotheses: 1. Hypothesis-testing (i.e. HT) was expected to improve interviewing quality (i.e. question types used), and consequently the quality of the information derived from the Avatars.
2. Feedback on outcome and question types was expected to improve interviewing quality, and consequently the quality of the information derived from the Avatars.
3. The combination of hypothesis-testing and feedback on Feedback was expected to result in more improvement in interview quality than the two interventions alone.
4. All three intervention groups were expected to perform better than the control group which received neither intervention (ie. no feedback on Feedback and no HT).
5. The total number of hypotheses formulated by the participants, the number of non-CSA hypotheses, and the ratio (more non-CSA compared to CSA hypotheses) were expected to predict both better question quality and quality of information derived from the Avatars (these analyses were limited to the HT conditions).
6. Improvement in interviewing quality was expected to mediate the impact of HT manipulation on interview quality variables (eliciting details).
We also explored how the experimental manipulations and the question types used impacted the Avatars' perceived reliability, contradictoriness and suggestibility.

Pre-registration
The current study was pre-registered on the Open Science Framework (OSF): https://osf.io/ 7ytnh. After the data analysis had begun, there were a few deviations from our original analysis plan regarding hypothesis-testing variables: To tackle the ratio from non-CSA to CSA hypotheses becoming infinite resulting from the absence of CSA hypotheses, we used the ratio of the number of CSA hypotheses of the total number of hypotheses as well as the number of non-CSA hypotheses of the total number of hypotheses. Apart from this, we added a measure of the number of non-CSA hypotheses subtracted from the number of CSA hypotheses.
The main statistical models were adjusted in the following manner: First, we chose maximum likelihood estimation (MLE) with conventional standard errors in lieu of bivariate Pearson correlations to be able to assess associations between variables at both the interview and participant levels. Second, we used Cohen's d effect size in lieu of Holm's method to quantify the magnitude of the differences in planned comparisons. Third, the Generalized Linear Model (GLM) included Time as a potential predictor of the correctness of conclusions in this study (All details were described in the Statistical Analyses). We also ran exploratory analyses to further examine the impact of the experimental manipulation on participants' assessment of information from avatars. missing. The conditions for recruiting the participants are: 1) 18 years old or older, 2) enrolled university student, and 3) the native language is Chinese. Informed consent was obtained from all participants via Qualtrics for inclusion in the study. Individuals that are younger than 18 years old or non-native speakers of Chinese were excluded from this research. Among these participants, no one had either experience in CSA investigations or parenting experience. Eight (9.9%) had experience taking a training course in child interviewing, and five (6.2%) had experience interviewing children. The board of research ethics at New York University Shanghai approved the study before the data collection commenced (2020-008).

Design
The present study employed a 2 (Feedback on Outcome and Question-type (Feedback): Present vs. Not present: between subjects) * 2 (hypothesis-testing (HT): A practice of HT and HT before each interview vs. Neither practice of HT or HT before each interview: between subjects) * 5 (Time: From 1st to 5th interviews: within-subjects) mixed design. Participants were randomly allocated into the control (n = 20), the feedback only (n = 19), the hypothesis-testing only (n = 22), or the combination of feedback and hypothesis-testing (n = 20) conditions.

Avatar training
The simulated interview application used for the experiment included avatars varying in their age (4 or 6 years old), gender (male or female), and presence of abuse (yes or no). Each condition has two avatars, resulting in a total of 16 avatars for the eight conditions. The Avatar Training adopted an answer selection algorithm designed by Pompedda et al. [26], which is based on the findings on the impact of question types on the elicited responses from children [12,48,65]. During each session, the questions asked by the interviewer were coded by the operator into the question types, and the answer selection algorithm process then chose the avatar's responses and showed the relevant video clip to the interviewer automatically.

Answer selection
Avatar Training is equipped with an algorithm that gives responses (either correct, irrelevant, or wrong) to the questions asked by the interviewers with predefined probabilities. The predefined probabilities of giving answers to either open-ended questions or closed questions are based on empirical research on real children's behavioral patterns during interviews [26]. For example, a 4-year-old avatar's probability of giving a "yes" response to a closed question is different from that of a 6-year-old avatar, with considerations of children's cognitive development. Each avatar has nine relevant details and nine neutral details stored in their pre-existing "memory". The relevant details contain information that can either substantiate the presence of abuse or offer a non-abuse explanation for the event. The neutral details are dispensable details that are prepared to make the simulation more realistic. The avatar algorithm would provide one of these either relevant or neutral details in response to every recommended question asked based on the probability that is pre-assigned. Only when recommended questions were asked would the relevant or neutral information be provided. The presentation of relevant and neutral details followed a set order, with the last four relevant details containing the crucial contexts to find the truth of each case. Interviewers could acquire incorrect details that were in conflict with the predefined memories when they asked not recommended questions. Under this setting, the avatars provide incorrect responses to not recommended questions occasionally (e.g., giving an affirmative answer to an option-posing question when this information is absent in the avatar's memory) that mimic children 's responses to such questions in real life.

Procedure
The experiment was carried out via WeChat and Zoom, taking approximately 2 hours for each participant. The probability and magnitude of harm/discomfort anticipated as a result of participating in this study were not greater than those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests. Participants were informed about the potential mental distress they might experienced in the course of the experiment and were recommended to consider carefully whether they wanted to participate in this research in the case that they had had abusive experiences since they might have more distress. After completing the informed consent and demographic information forms, participants in all four conditions read the guidelines for the correct questioning style and answered two comprehension checks (e.g., 'If the child provides a detail regarding the alleged situation, for example 'he punched me' which is the best question to ask? A Did it hurt? B Who punched you? C Was it your father?'). Participants were asked to read the instructions again and reenter their answers in case they gave an incorrect answer to one or both of the comprehension checks. Participants in either the HT only condition or the HT+Feedback condition were also asked to read the guidelines for hypothesizing and practice hypothesis-testing using two mock cases.
During the interview rounds, after reading the background information of each alleged CSA case, participants in the HT condition and the HT+Feedback condition were first asked to complete the form of hypotheses regarding the presence or absence of child sexual abuse in the case (see Table 1), and then answered two questions about their preliminary impression of the case before the interview: (1) the presence of the abuse ("present" or "absent") and (2) confidence in their assessment on a 6 point scale ("50% to "100%). For participants in the Feedback only condition and the control condition, there was no hypothesis generation procedure and they were asked to answer the two questions about their preliminary belief right after reading the background information. In all conditions, participants were instructed to verbally ask questions to gather information from the avatar in order to determine the presence or absence of sexual abuse.
Each participant conducted five interviews with five avatars randomly selected from the sixteen avatars. Each interview lasted up to 10 minutes. The participants could terminate the

Example of an mock case
Nicholas is a 6-year-old boy who lives alone with his mother. Nicholas, who has attended elementary school for 6 months, has always had problems with submitting to the authority of teachers. He is described as a restless child who is never seated, and he argues frequently with schoolmates. The only place where Nicholas seems to be at ease is during dance lessons, which his mother pushed him to start at the age of 4. Ever since the new teacher Richard arrived at the dance school, it seems that his passion for dance has increased. Richard and Nicholas have established a very good relationship and Nicholas often goes along with other children to Richard's home for private lessons. After one of these lessons he tells his mother that he was alone with Richard and that he had danced naked with him. His mother, alarmed by this fact, contacted the teacher to ask for an explanation. Richard said that this is absolutely not true, and that it was a fantasy Nicholas had, and that Nicholas had asked him to get naked and dance, but he absolutely said no to him. interview before 10 min had passed if they believed they had already gained enough information from the avatar. After each interview, participants were asked to answer three questions regarding the reliability of the information from the avatar: (1) How reliable is the information given by the avatar you just interviewed? not reliable (0%)-highly reliable (100%) (2) To what extent was the information given by the avatar you just interviewed contradictory? not contradictory (0%)-highly contradictory (100%) (3) How suggestible (i.e. going along with leading questions) was the avatar you just interviewed? not suggestible (0%)-highly suggestible (100%). Participants were also asked to make judgments on the presence or the absence of child sexual abuse for the second time based on the information elicited from the avatar. Different from the preliminary assumption, participants needed to first decide the presence or absence of the abuse ("present" or "absent") and then offer a detailed account of the event. If the answer to the first question was "present", participants needed to provide an account of where, how, and by whom was the avatar abused. If the judgment was "absent", participants needed to provide an account of the unfounded CSA allegation. The participants in the Feedback condition received feedback on the case outcome and question type (two recommended questions and two not recommended questions) after each interview. The feedback on question type contained the categorizations of the questions as well as their impacts on the reliability of children's statements.

Statistical analyses
Many of the hypotheses formulated by the participants were in the format "A was sexually abused by B" and "A was not sexually abused by B". That is, different from what we expected, participants in these cases did not actually do any in-depth analysis of the specific scenarios that could underlie the suspicion of abuse, which is essential for the hypothesis-testing approach. Hence, we reported the analyses using the original hypothesis variables (i.e., the numbers of CSA and non-CSA hypotheses) per the pre-registered plan but also examined Hypothesis 5 using the number of CSA and non-CSA hypotheses after excluding hypotheses of this unanalytical type (named: revised Hypothesis related variables).
Bland and Altman [66] introduced the correlation analysis of repeated observations. When looking at whether the increase (or decrease) of one variable within each interview is associated with the variability of the other, we removed the differences among participants and calculated within-subject correlations. When looking at whether participants with a higher (or lower) value of one variable have a trend toward having a higher (or lower) value of the other, we calculated the between-subject correlations by using the averages of the variables of each participant. Correlation analyses were modeled at both within-and between-subject levels simultaneously by using the multilevel.cor function within the R package lavaan [67]. To look at (1) the validity of the avatars' answering algorithms and (2) the associations between participants' assessment of the information from avatars and their questioning skills as well as the types of details elicited from avatars in each interview, we only focused on the within-subject level (interview level in our case) correlations coefficients to assess whether the correctness of conclusions, the types of questions, the types of details, and the assessment of the avatar in terms of reliability, contradictoriness, and suggestibility, are associated with each other. Afterward, we looked at both the within-and between-subject levels for the hypotheses related variables, types of questions, as well as the number of details to examine Hypothesis 5. The Maximum Likelihood estimation (MLE) with conventional standard errors was used as the estimator to provide the best fitted linear model. The correlation coefficients at the interview level represent the linear associations between two measures after controlling the variance between participants in our case.
Given that participants who were allocated into either the HT or the HT+Feedback conditions built hypotheses before the first interview, while the Control and the Feedback conditions did not receive any intervention before the first interview, we compared baseline performance according to whether participants built hypotheses or not using a series of t-tests. In addition, for exploratory purposes, we also compared differences in hypothesis-testing between the HT condition and the HT+Feedback condition.
In the main analyses, we conducted a series of 2 (Feedback: Present vs. Not present: between subjects) * 2 (Hypothesis-testing (HT): Training in HT and HT before each interview vs. Neither training in HT nor HT before each interview (between subjects) * 5 (Interview number: From 1st to 5th interviews: within-subjects) three-way mixed multivariate analysis of variance (MANOVA) to test the effect of Avatar training on the types of questions, types of details, and the assessment of information from Avatars by using the R package MANOVA. RM [68], we chose the modified ANOVA-type statistic test (MATS) with the parametric bootstrap procedure, which is suitable for non-normal multivariate models. Compared to the Wald-type statistical test (WTS), MATS provides robust test statistics without the requirement of extremely large sample sizes [69]. For the analyses at the univariate level, we used the R package afex [70] to conduct a series of 2 (Feedback: Present vs. Not present: between subjects) * 2 (Hypothesis-testing (HT): Training in HT and HT before each interview vs. Neither training in HT nor HT before each interview (between subjects) * 5 (Interview number: From 1st to 5th interviews: within-subjects) three-way mixed analysis of variance (ANOVA) to test the effect of Avatar training on the types of questions, types of details, and the assessment of information from Avatars. Mauchly's Test of Sphericity revealed that all dependent variables except the proportion of recommended questions, the number of wrong details, and the perceived contradictoriness of information from Avatars satisfied the sphericity assumption since their epsilon (ε) values were > 0.75. The Greenhouse-Geisser correction was used to correct the degrees of freedom of the proportion of recommended questions and the number of wrong details since the epsilon (ε) values was < 0.75 [71]. Planned pairwise comparisons were conducted by using the R package emmeans [72]. automatically loaded package graphics and stats [73]. Mean Differences (MD) were calculated by using the pair () function with the Tukey method to adjust for multiple comparisons to identify at which interview the differences were significant. When looking at the dichotomous variables: (1) correctness of conclusions and (2) correctness of complete conclusions, we used R package lme4 [74] to perform a generalized linear mixed model, and the 95% confidence intervals of the estimation were calculated by using the Wald method.
Overall means in the first interview were 9.01 (SD = 6.56) for the number of recommended questions, 13

Correlations confirming correct functioning of the simulation
The correlations between the types of questions, the types of details elicited by participants, and the assessment of information of avatars at the within-subject level are presented in Table 2. The number of recommended questions and the proportion of recommended questions were significantly positively associated with both the number of relevant details and the number of neutral details, as well as negatively associated with the number of wrong details. These patterns were reversed for the number of non-recommended questions. These withinsubject correlations confirm that the algorithms worked as expected.

Preliminary analyses related to the experimental manipulation
A series of t-tests revealed no significant differences between experimental groups in their baseline performance with the exception of the number of wrong details, where we found a significant difference between the HT (M HT = 4.18, SD = 2.17) and the HT+Feedback conditions (M HT+Feedback = 1.70, SD = 2.03): t(39.97) = 3.83, p < .001, 95% CI [1.17, 3.79]. This is probably a chance effect.
Regarding the hypotheses that were constructed by participants, we found significant differences between the HT and the  In addition, HT showed a significant main effect only on the number of recommended questions (F(1, 77) = 5.43, p = .022, η g 2 = 0.04). Significant two-way interactions were also found between Feedback and Interview number on both the number of non-recommended questions and the proportion of recommended questions (Non-recommended: F(3.63, 279.13) = 21.14, p < .001, η g 2 = 0.08; Proportion: F(3.10, 238.86) = 18.99, p < .001, η g 2 = 0.10) (Fig 1).

Effects of Avatar training on questioning skills and interview quality
To look at the effects of the HT and Feedback interventions, respectively, in each interview, we combined pairs of experimental conditions and conducted planned comparisons between HT conditions (i.e., the combination of HT and HT+Feedback conditions) and Non-HT conditions as well as between the Feedback conditions (i.e., the combination of Feedback and HT +Feedback conditions) and Non-feedback conditions. Compared to Non-feedback conditions, participants in Feedback conditions presented significantly more recommended questions during the 3rd, 4th, and 5th interviews, while fewer non-recommended questions in all interviews except the first one. Combined, Feedback conditions increased the proportion of presented recommended questions during the 2nd, 3rd, 4th, and 5th interviews, indicating that receiving feedback improved the questioning skills of participants over time. Compared to Non-HT conditions, receiving HT intervention made participants present fewer recommended questions during the 4th and 5th interviews (Table 3).  Planned comparisons between each pair of conditions in each number of interviews revealed that participants who were in the Feedback condition presented significantly more recommended questions and fewer non-recommended questions compared to HT during the 3rd, 4th, and 5th interviews. Compared to the Control condition, participants in the Feedback condition also presented more recommended questions during the 3rd and 5th interviews, while fewer non-recommended questions during the 3rd, 4th, and 5th interviews. Compared with HT only, combining this intervention with feedback made participants present significantly more recommended questions during the 4th interview. As expected, this significant and positive effect of receiving the combined intervention was also observed in the number of non-recommended questions compared to both HT and Control conditions during the 2nd, 3rd, 4th, and 5th interviews. Combined, regarding the proportion of recommended questions, both Feedback and HT+Feedback conditions increased the proportion of recommended questions during all interviews in comparison with both HT and Control conditions ( Table 4).
For the quality of the information derived from the Avatars (i.e., the number of relevant details, the number of neutral details, and the number of wrong details, hereafter, abbreviated to Relevant, Neutral, and Wrong, respectively), multivariate level significant effects were 0.04). When looking at the number of wrong details, there was also a significant interaction between the HT and Feedback (F(1, 77) = 4.36, p = .040, η g 2 = 0.03) (Fig 1).
Compared to Non-feedback conditions, participants in Feedback conditions significantly elicited more relevant details during the 3rd, 4th and 5th interviews. Additionally, Feedback conditions had significantly more neutral details, as well as fewer wrong details, in all the number of interviews except the first. However, no significant differences between HT and Non-HT conditions were observed (Table 5).
Planned comparisons between each pair of conditions in each number of interviews revealed that participants in the Feedback condition significantly elicited more relevant details compared to participants in the Control condition during the 3rd and 5th interviews. The Feedback also improved the number of elicited neutral details during the 3rd interview. Compared with the HT condition, combining HT with Feedback made participants elicit significantly more relevant details during the 4th and 5th interviews, while more neutral details during the 5th interview. Additionally, this combined intervention significantly increased the number of relevant details compared to the Control condition during the 4th interview. Compared to the HT condition, participants in the Feedback condition elicit more relevant and neutral details during the 4th and 5th interviews. For the number of wrong details elicited by participants, compared to HT condition, receiving Feedback intervention resulted in significantly fewer wrong details presented by the participants during the 3rd, 4th, and 5th interviews. Participants in the Feedback condition also elicited fewer wrong details during the 5th interview in comparison with participants in the Control condition. Moreover, compared with HT only, combining the HT with Feedback intervention significantly reduced the number of wrong details during the 3rd, 4th, and 5th interviews. Also, the combined intervention made participants elicit significantly fewer wrong details during the 5th interviews in comparison with the Control condition. In sum, these results illustrated the importance of Feedback to improve the interview quality by eliciting more relevant and neutral details, as well as fewer wrong details as a function of the number of interviews (Table 6). In sum, Feedback intervention (rather than HT intervention) improved the interviewing quality, and the quality of the information derived from avatars, these results did not support Hypothesis 1 but provided support to Hypothesis 2. Compared to the HT condition (rather than the Feedback condition), the combination of HT and Feedback resulted in more improvement in interview quality. Hence, Hypothesis 3 was partially supported. Since the hypothesistesting intervention did not have the expected effect, Hypothesis 5 was not supported. Due to the effect of HT not being found, we were not able to perform a mediation analysis to test Hypothesis 6 concerning the potential mediation role of improvement in questioning skills on the impact of hypothesis-testing intervention on interview quality variables.

Effects of Avatar training on the assessment of the avatars
For the assessment of the Avatars (i.e., perceived reliability, contradictoriness, and suggestibility of the information from the avatars, hereafter, abbreviated as Reliability, Contradictoriness, and Suggestibility, respectively), multivariate level significant effects were found of the   Compared to Non-feedback conditions, participants in Feedback conditions perceived the information as more reliable during the 3rd, 4th, and 5th interviews, while as less contradictory during all interviews except the first one. Also, receiving Feedback intervention made participants perceive the avatars as less suggestible than participants in Non-feedback conditions during the 2nd, 3rd, 4th, and 5th interviews. However, there was no significant difference between HT and Non-HT conditions in the assessment of information (Table 7).
Planned comparisons between each pair of conditions in each number of interviews revealed that both Feedback and HT+Feedback conditions made participants more reliability, while less contradictoriness and suggestibility of perceived information from avatars in comparison with the HT condition during 5th interview. We also found that participants who were in the HT+Feedback condition reported more perceived reliability of the information compared to the HT condition during the 4th interview ( Table 8).

Effects of Avatar training on the correctness of conclusions
There were two variables related to correct conclusions reached by participants including (1) whether the abuse was present or not (i.e., binary choices) and (2) correctness of complete

Correlations between hypotheses testing, types of questions, and types of details
Turning to the correlations between built hypotheses and types of questions, as well as types of elicited details at both within-subject and between-subject levels are presented in Table 9. At the within-subject level, the total number of hypotheses (without excluding non-analytic hypotheses) was negatively associated with the number of recommended questions, the proportion of recommended questions, and the number of neutral details. More specifically, the number of both non-analytic CSA and non-CSA hypotheses showed similar patterns. These results suggest that the more hypotheses the interviewers formed, the worse the quality of the interview was. After excluding the non-analytic hypotheses, the negative association still remained only between the number of non-CSA hypotheses and the number of recommended questions. However, a different pattern was found at the between-subject level: participants who were able to form more analytic non-CSA hypotheses asked more recommended questions and elicited more relevant details compared to participants who formed fewer non-CSA hypotheses. Table 9. a, f Within-subject and between-subject correlations between types of questions, types of details, and hypotheses built.  The lower triangular of the matrix presents the within-subject correlation coefficients and the upper triangular of the matrix presents the between-group correlation coefficients. b Based on which condition participants were randomly allocated in, only 42 (81 in total) participants constructed hypotheses before conducting each interview. Hence, the correlations involved in Hypotheses (i.e., items 7, 8, 9, and 10) were analyzed based on the sample size n(within-subject) = 210 and n(between-subject) = 42. c A ratio of (non-) CSA hypotheses to total hypotheses: Calculated by the number of (non-) CSA hypotheses divided by the total number of hypotheses, based on the presence (or absence) of the sexual abuse (e.g., if the sexual abuse was presented, A ratio of CSA hypotheses = CSA hypotheses/Total hypotheses). If participants did not construct any hypothesis before the interview, we coded the ratio as 0. This suggests that participants with a greater ability to form analytic hypotheses performed better at interviewing compared with their peers who were less capable of forming hypotheses. However, at the interview level, the more hypotheses, including the non-analytic ones, the participant formed, the worse the interview performance was.

Correlations between interview quality and perception of avatars
As for the associations between interview quality and perceptions of avatars in each interview (Table 10), within-subject correlations showed that the number (and proportion) of recommended questions were positively associated with the perceived reliability of information from avatars. In addition, the proportion of recommended questions was also negatively associated with perceived contradictoriness as well as perceived suggestibility. The number of non-recommended questions was positively associated with contradictoriness while negatively associated with perceived reliability. When looking at the details elicited by participants, the number of relevant details was positively associated with perceived reliability, negatively associated with perceived suggestibility as well as perceived contradictoriness. Neutral details only had a significant positive correlation with perceived reliability. The number of wrong details had significant positive associations with perceived suggestibility and contradictoriness. In sum, these results suggest that interviewers were able to form a generally accurate appraisal of interview quality consistent with the more objective indicators such as the number of recommended questions or the number of relevant details.

Correlations between interview quality and correctness of conclusions
Notably, when looking at the correctness in concluding the presence (or absence) of the abuse and, the correctness of complete conclusions (Table 10), there was only a statistically significant positive correlation with the number of relevant details at the within-subject level, indicating that in each interview, eliciting more relevant details resulted in the given participant being more able to reach the correct conclusion. Besides the binary choices, participants were asked to offer complete accounts of what happened to the avatars. As expected, the number (and  proportion) of recommended questions, as well as the number of relevant details were positively associated with the correctness of complete conclusions, whereas the number of nonrecommended questions showed a reversed pattern. These results illustrated that the more recommended questions were presented or relevant details were elicited within each interview, a given participant was more likely to reach an accurate account of the abuse allegation.

Discussion
The current study was the first to examine the efficacy of Avatar Training on improving interview quality in the Chinese context. Consistent with previous findings [26][27][28]30,33], feedback (combined on question types and case outcomes) showed positive effects on both the questioning skills of the interviewers and the amount of elicited information from avatars, this latter finding being a function of the algorithms driving the avatars. Apart from this, our main focus was to look at the effects of hypothesis-testing on interviewing behavior. Contrary to our prediction, building hypotheses before each interview did not show the expected positive effect on interview quality. However, we did notice that the incorporation of hypothesis-testing into Avatar Training did not bring about negative effects on the effectiveness of feedback. Finally, the current study made further contribution to CSA interviewing research by examining interviewer's perceptions concerning avatars' reliability and their associations with questioning skills as well as information elicited. In sum, lack of support was found for the below hypotheses in the current study: Hypothesis 1. HT was expected to improve interviewing quality (i.e. question types used), and consequently the quality of the information derived from the Avatars. Hypothesis 4. All three intervention groups were expected to perform better than the control group which received neither intervention (ie. no feedback on Feedback and no HT).
Hypothesis 5. The total number of hypotheses formulated by the participants, the number of non-CSA hypotheses, and the ratio (more non-CSA compared to CSA hypotheses) were expected to predict both better question quality and quality of information derived from the Avatars (these analyses were limited to the HT conditions). Hypothesis 6. Improvement in interviewing quality was expected to mediate the impact of HT manipulation on interview quality variables (eliciting details).
On the other hand, Hypothesis 2 (Feedback on Feedback (i.e. outcome and question types) was expected to improve interviewing quality, and consequently the quality of the information derived from the Avatars.) was supported. Hypothesis 3 (The combination of hypothesis-testing and feedback on Feedback was expected to result in more improvement in interview quality than the two interventions alone.) was partially supported. In the following sections, we briefly discuss the findings, their implications, and the limitations of the current research.

Avatar training with feedback as a reliable tool to improve interview quality
As noted above, we found that providing feedback significantly improved questioning skills, which led to better information gathering, that is, more relevant and neutral details and fewer wrong details being elicited from the Avatars. Considering that interviewing children who were allegedly sexually abused is a complex skill, training does not necessarily result in sustained improvements in the absence of continuous and timely feedback [17,25,75,76]. Our results again confirm that feedback is essential to shaping interviewer behavior. These results also mean that Avatar Training, accompanied by feedback on interviewers' behaviors and case outcomes, can be administered via the Internet and that the effect is essentially the same also in this Chinese version of Avatar Training.
Regarding the correctness of the conclusions drawn by the participants, we found that participants who presented more recommended questions made more completely correct conclusions. Using recommended questions elicits more useful details from the avatars making it possible to formulate complete accounts of what happened to the avatar. However, the interventions did not significantly influence the correctness of the conclusions. This result was inconsistent with previous Avatar Training studies. Previous studies have demonstrated that feedback positively influenced the probability of correct conclusions [26][27][28][29][30]33]. For example, Pompedda et al. [30] found that the percentage of groups receiving both feedback (outcome and process) was higher (29%) than the group receiving feedback on process (21%), the group receiving feedback on the outcome (15%), and the control group (17%). Compared to the 20% of completely correct conclusions obtained in the study of Pompedda et al. [30], only 14.8% arrived at completely correct conclusions in the current study. The percentage of the group receiving feedback alone was higher (18.9%) than the group receiving the hypothesis-building intervention (10%), and the group receiving both interventions (12%). The percentage of the control group was 18%. These differences were not significant.
The investigative interview is a complex practical skill that places high demands on interviewers [77]. When faced with a challenging task, decision-makers tend to cost as little cognitive effort as possible and opt for a simplifying way to make decisions [78,79]. That is to say, participants might make judgments too fast without analyzing the gathered information comprehensively, hence undermining the correctness of conclusions. In the current study, interviewers received feedback on both outcomes of the case and the question types they presented, but no information on the accuracy of details. This feedback intervention improved interview quality (i.e., interviewers presented more recommended questions). However, it may not have established a connection between the elicited details and the correctness of conclusions. A potentially useful way to meet the optimal outcome is to provide feedback regarding whether the details elicited from avatars were accurate or not to assist CSA interviewers in drawing correct conclusions in future studies.

Hypothesis-testing: A not-yet-successful implementation
The idea behind the hypothesis-testing approach, as elaborated in Korkman et al. [50], is to combat confirmation bias by building alternative hypotheses using information related to the case at hand and then evaluating these hypotheses against information obtained from the interview and other sources. Contrary to our prediction, building hypotheses alone did not have a positive effect on questioning skills, types of elicited information, or the ability to conclude whether avatars had been sexually abused. Summed over interviews, compared to participants who did not receive any intervention, we found a non-significant trend that building hypotheses prior to the interview promoted the use of non-recommended questions (Control versus HT: M = 15.25, SD = 8.68 versus M = 17.31, SD = 7.95) and interviewers elicited more wrong details from Avatars in the HT only condition (HT: M = 4.69, SD = 2.55) than the control condition (Control: M = 3.52, SD = 2.35). While the simulated training with feedback is meant to promote a child-led interview, the intervention of hypothesis-testing requires interviewers to take on a more active role. As mentioned before, there is potentially some tension between hypothesis-testing and the use of open questions. It is natural for interviewers to ask closed questions because that may help with the effective hypothesis-testing when the interviewee's age and cognitive capacity have been taken into consideration. On the other hand, non recommended question dominated interview, which represents interviews conducted by the hypothesis-testing group (see Fig 1 Panel C), is distant from ideal for conducting a highquality interview with young children according to existing studies [48,49,80].
However, the current results do not necessarily mean that hypothesis-testing per se lacks efficacy. In our experiment, we did not assess whether the actual questions asked tested both abuse and non-abuse hypotheses in the hypothesis-testing conditions. It is thus possible that the HT group did conduct an interview that tested hypotheses in a balanced manner albeit not using open questions. So far no studies have assessed the efficacy of such an approach to CSA interviewing. However, we did not find any differences between the groups in conclusion correctness suggesting that this approach was neither clearly superior nor inferior to the other experimental groups. Moreover, in the present study, the interviewers lacked relevant experience in CSA interviews and the hypothesis-testing approach. Previous research evaluating professionals' and laypersons' knowledge of interviewing children pointed out that students are less knowledgeable than professionals, especially uninformed when considering the negative impact of leading questions [81]. However, in the context of CSA investigative interviewing, there is no convincing evidence of the positive effects of experience alone on interview quality [28,35,46]. Therefore, we refrain from making strong claims that the current results are due to inexperience.
Notwithstanding a brief guideline on the importance of only employing recommended (e.g., open-ended questions) and non-leading questioning techniques provided to the participants before the first interview, it seemed that only building hypotheses may unintentionally shape the questioning styles of interviewers in a negative direction leading to an increased number of non-recommended questions [44,82,83]. We were also interested in understanding whether learning questioning skills and the hypothesis-testing approach simultaneously introduced too much new information to the interviewers having a negative impact on their learning outcomes. However, when combined with feedback, the negative effect of hypothesistesting on questioning skills and interviewing quality was mitigated. These findings support the notion that feedback not only re-enforces correct questioning style, but also provides interviewers with the opportunity to continuously assess and improve their questioning style.
The combined interventions group obtained similar numbers of relevant, neutral, and wrong details as the Feedback group, but reached fewer completely correct conclusions. The tasks between the feedback group and the combined interventions group were different. One possible explanation is that when giving an explanation of what happened in the cases, the feedback group provided a report based on the information the avatar shared with them, whilst the combined intervention group had to compare the information elicited from the avatar with their hypotheses before they arrive at a judgment. It may be that information elicited was insufficient to allow for differentiating between the hypotheses and reaching a coherent conclusion.
Interestingly, in correlation analyses, we found that participants who built more good quality hypotheses tended to ask more recommended questions. This positive association was primarily derived from an increase in non-CSA hypotheses built. The ability to build non-CSA hypotheses was also related to the average number of recommended questions presented by participants. These are encouraging findings and in line with our overall expectations, although not based on a priori hypotheses.

Perceived reliability of the avatar depends on who is asking the question
On average, interviewers tended towards perceiving the child avatars as better rather than worse witnesses. Compared with conditions without feedback, in conditions with feedback, participants perceived Avatars to be more reliable and less suggestible and believed that the avatars showed fewer contradictions. These differences were obvious in the second half of the interviews. Unsurprisingly, the proportion of recommended questions and the number of correct details were positively associated with perceived reliability and negatively associated with perceived suggestibility and contradictoriness. The pattern was the opposite for the number of not recommended questions and the number of wrong details. The above results suggest that interviewers were able to have a fairly accurate appraisal of the interview process. To elaborate, based on the predefined algorithms, if the interviewer asked a broad-invitation question (e.g., "Tell me everything you still remember"), the Avatar would respond to this question by retrieving narrative details from their "memory". Then, if the interviewer would have continued to ask a facilitator question (e.g., "And then?") and encouraged the Avatar to keep talking based on what they mentioned before, the Avatar would continue to provide more correct details in their responses. Open-ended questions made the avatars draw information from "recall memory" rather than just selecting "yes" or "no" in response to close-ended questions, which could have introduced false details contradicting other information [15,17,84]. Hence, the perception that the avatars were less reliable and more susceptible when the interviewers employed a larger proportion of not recommended questions is to be expected, and our results confirm that it is noticeable to the participants.
As succinctly summarized by Denne et al. [85], the way in which we question the children about their memories of abuse is a strong predictor of their ability to communicate, and further forms the foundation of their reliability. Our study provides direct evidence for this proposition. On the one hand, it is encouraging that interviewers could form somewhat accurate reliability judgments. On the other hand, the current study offers no insight into whether interviewers are cognizant of the fact that the reliability of the child is linked to their questioning style. In hindsight, it would also be informative to ask interviewers to rate their own interview performance and examine the correlations between self-appraisal and the perceptions of the child avatar.

Differences between the current and previous studies in the number of questions used
The total number of questions asked by participants in the current study was, on average, 25, which is fewer compared to those asked by European participants, on average 40 [30]. Around 16 questions were asked on average by Japanese participants [27]. Haginoya et al. [27] argued that asking fewer questions of any type could be a strategy Japanese participants adopted to avoid making mistakes in public because of their high sensitivity to negative feedback. However, this might not be the case for the current study. There were no significant differences in the total number of questions when comparing the feedback group to the hypothesis-testing group and control group. Differences in the total number of questions asked were only observed when comparing the combined interventions group to the other three groups, where the combined interventions group asked the fewest questions. A possible explanation is that dual tasks are cognitively more demanding than a single task: When making a decision on what to ask, the combined interventions group must ensure they tested hypotheses with open questions while other groups can go straight forward to the paths where they could keep testing hypotheses with no restriction of question types (hypothesis-testing group), or solely focusing on using more open questions and fewer close-ended questions as reminded by feedback (feedback group), or keep asking whatever they believed can help to find out the truth (control group). The fewer total number of questions asked in the current study compared to previous studies can be explained by the use of fewer non-recommended questions. Participants, in general, used a similar amount of recommended and non-recommended types of questions. While the number of recommended questions in the current study (M = 12.11, SD = 6.93) is similar to those observed (M = 13.52, SD = 7.96) in earlier studies (e.g., see [30]), the number of non recommended questions (M = 12.38, SD = 8.09) was half of those in an earlier study (M = 26.39, SD = 15.66). Interestingly, Zeng et al. [86] found that Chinese police officers used a large proportion of open questions (68.9%) when interviewing suspects: The use of open questions differed across three phases, 83.7% during the opening phase, 67.2 during the information-gathering phase, and 43.4% during the closing phase. However, the participants in the current study were university students which would suggest that a general interaction style specific to the Chinese may underlie the finding.

Limitations and future directions
There are a few more limitations in addition to the one just mentioned above. First, the implementation of the hypothesis-testing procedure was not optimal in the current study. Many of the hypotheses were unfortunately not analytical. Second, given that hypothesis-testing is a cognitively demanding task, it could take the trainees more than two practice rounds to obtain an adequate level of skill. Therefore, the training of hypothesis-testing itself may require further investment to be effective.
The current study lacked sensitivity to elements other than question type when measuring interview quality. The exact questions the interviewers asked were not recorded, so we did not know whether there was a difference in the quality of questions of those in the combined interventions group compared to those who received a single intervention alone. Thus, we recommend that future studies look specifically into the question being asked as well as the selfreported strategy in managing two tasks to gain a better understanding of the execution of hypothesis-testing via interviewer performance, that is, whether the question itself allows testing of hypothesis, as well as of how interviewers, even though with improved questioning skills and good quality information elicited from Avatars, landed on so few completely correct conclusions.

Conclusions
Taken together, the current study provides further evidence that Avatar Training with feedback is an effective approach for CSA interview training across cultural contexts. Feedback significantly increases the usage of open-ended questions and decreases the usage of close-ended questions, which in turn improves the accuracy of information gathered from the child avatars. Although we fail to observe any positive effects of hypothesis-testing in improving interviewing quality, we caution against the interpretation that hypothesis-testing lacks efficacy as interview quality was only measured in terms of question types. Moreover, we provided direct evidence that children's perceived reliability depends on interviewers' questioning style and encouraged researchers to further explore the interviewer's perceptions and judgments about their interview process.